-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1789581: Fix osImageURL upgrade race #1357
Bug 1789581: Fix osImageURL upgrade race #1357
Conversation
/hold |
Sad we're not 🦊 ing it... |
f8fa322
to
a64c9f7
Compare
@@ -686,6 +686,11 @@ func (optr *Operator) getOsImageURL(namespace string) (string, error) { | |||
if err != nil { | |||
return "", err | |||
} | |||
releaseVersion := cm.Data["releaseVersion"] | |||
optrVersion, _ := optr.vStore.Get("operator") | |||
if releaseVersion != optrVersion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any valid cases where the configmap from a previous setup would either be unversioned or not match while transitioning to versioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous configmaps won't be versioned; but that's fine because Golang doesn't have Option<>
so the releaseVersion
will be the empty string, which won't match. Which is what we want - we'll ignore the previous unversioned configmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notice there would still be a possible race where the 4.2 MCO reads the new 4.3 osimageurl configmap and could generate a new rendered mc with a newer osimageurl, that's why I believe we need to patch 4.2 as well to take versioning into account
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notice there would still be a possible race where the 4.2 MCO reads the new 4.3 osimageurl configmap and could generate a new rendered mc with a newer osimageurl, that's why I believe we need to patch 4.2 as well to take versioning into account
I would rather we address that race condition directly rather than trying to patch 4.2 which requires us to enforce that 4.2.x (unpatched) upgrades through 4.2.x+1(patched) before going to 4.3. That's possible in theory but seems to be cumbersome in practice.
Some thoughts:
- Version the key in the
osimageurl
configmap (e.g.osImageURL-0.0.1-snapshot: ...
) so that each MCO release has its own image reference. This would address this upgrade issue by changing the key name in 4.3 and protects us from any future upgrade race-conditions and failures. It would also allow you to do some weird cross-version testing, like if you wanted to have multiple MCOs referencing different osimages for some reason. - Set the osimageurl value on the MCO deployment directly - it can be a flag passed in to the operator, the release infra will replace the image reference in the deployment manifest the same as it does in the configmap. If there are other reasons for using a configmap, the MCO deployment can be changed to reference a configmap mounted in just like all the other images. This would reduce any version issues to mismatches in the CC/MCC which we've dealt with before.
- Make MCO updates atomic by putting everything that always and only updates together together. The osimageurl, MCO, MCC controllers, and MC templates. Simplify controller architecture, merge "operator" and "controller" #878
I haven't traced through the code and the logs from the race, but I completely believe you have and this looks highly likely to fix the problem. I looked at previous variants of this like #1198 and the original was 835278f Looking over the manifests we install, I think this is the last thing that is input to the MachineConfig that isn't versioned. The other CVO-installed manifests are pure kube level stuff like RBAC rules etc. /approve |
a64c9f7
to
c5c8456
Compare
Prometheus was not happy. /test ci/prow/e2e-aws |
/test ci/prow/e2e-aws |
/test e2e-aws |
This patch also guards against unversioned osImageURL config maps in the future. Signed-off-by: Antonio Murdaca <[email protected]>
Signed-off-by: Antonio Murdaca <[email protected]>
c5c8456
to
b70a74a
Compare
@runcom: This pull request references Bugzilla bug 1789581, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest Please review the full test history for this PR and help us cut down flakes. |
26 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@runcom: All pull requests linked via external trackers have merged. Bugzilla bug 1789581 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This patches does two things:
makes sure we create the osimageurl configmap before the new MCO deployment manifest is appliedNOTE
it might be needed to version the ControllerConfig as well (???)related bug https://bugzilla.redhat.com/show_bug.cgi?id=1786993